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1 Scaling question answering to the web 

#Cody Kwok, Oren Etzioni, Daniel S. Weld 
July 2001 ACM Transactions on Information Systems (TOIS), volume 19 issue 3 

Publisher: ACM Press 

Full text available: ^ pdf(452.33 KB) Additional Information: full citation , abstract , references, index terms 

The wealth of information on the web makes It an attractive resource for seeking quick 
answers to simple, factual questions such as "e;who was the first American in space?"e; 
or "e;what is the second tallest mountain in the world?"e; Yet today's most advanced web 
search services (e.g., Google and AskJeeves) make it surprisingly tedious to locate 
answers to such questions. In this paper, we extend question-answering techniques, first 
studied In the information retrieval literature ... 

Keywords: answer extraction, answer selection, natural language processing, query 
formulation, search engines 



2 Scaling question answering to the Web H 

Cody C, T. Kwok, Oren Etzioni, Daniel S. Weld 
^ April 2001 Proceedings of the 10th international conference on World Wide Web 

Publisher: ACM Press 

Full text available: pdf(294.27 KB) Additional Information: full citation , references , citings , index terms 



3 Adapting content to mobile devices: DOM-based content extraction of HTML 
documents 

Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm 

May 2003 Proceedings of the 12th international conference on World Wide Web 

Publisher: ACM Press 

Full text available: 1 i|pdf(296.17 KB) Additional Information: full citation , abstract, references , citings, index 
^'^^ terms 

Web pages often contain clutter (such as pop-up ads, unnecessary images and 
extraneous links) around the body of an article that distracts a user from actual content. 
Extraction of "useful and relevant" content from web pages has many applications, 
including cell phone and PDA browsing, speech rendering for the visually impaired, and 
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text summarization. I^ost approaches to removing ciutter or malcing content more 
readable involve ctianging font size or removing HTI^L and data components such as 
imag ... 

Keywords: DOM trees, HTML documents, accessibility, content extraction, reformatting, 
speech rendering 



4 Integrating the document object model with hvperlinks for enhanced topic distillation ^ 

and information extraction 
Soumen Chakrabarti 

April 2001 Proceedings of the 10th international conference on World Wide Web 

Publisher: ACM Press 

Full text available: pdf( 369.63 KB) Additional Information: full citation , references , citings , index terms 




Keywords: document object model, minimum description length principle, segmentation, 
topic distillation 



Web mining, tools, and performance evaluation: Web classification using support 
vector machine 

Aixin Sun, Ee-Peng Lim, Wee-Keong Ng 

November 2002 Proceedings of the 4th international workshop on Web information 
and data management 

Publisher: ACIVI Press 

Full text available- ■ mDdf(327.09 KB) Additional Information: full citation , abstract, references , dtings. index 
^^^^^-'^ terms 

In web classification, web pages from one or more web sites are assigned to pre-defined 
categories according to their content. Since web pages are more than just plain text 
documents, web classification methods have to consider using other context features of 
web pages, such as hyperlinks and HTML tags. In this paper, we propose the use of 
Support Vector Machine (SVM) classifiers to classify web pages using both their text and 
context feature sets. We have experimented our web classification met ... 

Keywords: SVM, web classification, web mining 



Translation of web queries using anchor text mining 
Wen-Hsiang Lu, Lee-Feng Chien, Hsi-Jian Lee 

June 2002 ACM Transactions on Asian Language Information Processing (TALIP), 

Volume 1 Issue 2 
Publisher: ACM Press 

I- II * ^ I ui iSDi ^t/nnn -7n i^ox Additional Information: full citation , abstract , references, citinas . index 

Full text available: "pi pdf(290.79 KB) ; 

^'^^^'"^ terms 

This article presents an approach to automatically extracting translations of Web query 
terms through mining of Web anchor texts and link structures. One of the existing 
difficulties in cross-language information retrieval (CLIR) and Web search is the lack of 
appropriate translations of new terminology and proper names. The proposed approach 
successfully exploits the anchor-text resources and reduces the existing difficulties of 
query term translation. Many query terms that cannot be obtained in ... 

Keywords: anchor text mining, comparable corpora, cross-language information 
retrieval, machine translation, parallel corpora, web mining 
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7 Scalable feature selection, classification and signature generation for organizing 

large text databases into hierarchical topic taxonomies 

Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan 

August 1998 The VLDB Journal — The International Journal on Very Large Data 

Bases, volume 7 Issue 3 

Publisher: Springer-Verlag New York, Inc. 

Full text available: ^ pdf(281.37 KB) Additional Information: full citation, abstract, dtings, index terms 

We explore how to organize large text databases hierarchically by topic to aid better 
searching, browsing and filtering. Many corpora, such as internet directories, digital 
libraries, and patent databases are manually organized into topic hierarchies, also called 
taxonomies. Similar to indices for relational data, taxonomies make search and access 
more efficient. However, the exponential growth in the volume of on-line textual 
information makes It nearly impossible to maintain such taxono ... 

8 IVIultimedia and visualization: Dynamic structuring of web information for access 
visualization 



Jess Y. S. Mak, Hong Va Leong, Alvin T. S. Chan 

March 2002 Proceedings of the 2002 ACM symposium on Applied computing 
Publisher: ACM Press 

Full text available: ^ pdf(765.23 KB) Additional Information: full citation , abstract , references , index terms 

The Internet has led to the formation of a global information infrastructure. To explore a 
web site, a site map would be useful as a short cut for a user to locate for the target 
information in a structured and efficient manner, rather than drilling into the web site 
following hyperlinks, reading possibly irrelevant information. Useless Information Impacts 
a mobile web environment, where mobile clients are only connected with unreliable 
wireless channels of limited bandwidth. Structured web page ... 

Keywords: DOI^, VRI^L, XML, visualization, web document structure 



9 Writing the web: Mining topic-specific concepts and definitions on the web 
|k Bing Liu, Chee Wee Chin, Hwee Tou Ng 

^ May 2003 Proceedings of the 12th international conference on World Wide Web 
Publisher: ACM Press 

Full text available- PlDdff245,66Km Additional Information: full citation , attract, references , dtings. ind^ 
' t^^^J-'^^^" terms 

Traditionally, when one wants to learn about a particular topic, one reads a book or a 
survey paper. With the rapid expansion of the Web, learning in-depth Icnowledge about a 
topic from the Web is becoming increasingly important and popular. This is also due to 
the Web's convenience and its richness of Information. In many cases, learning from the 
Web may even be essential because in our fast changing world, emerging topics appear 
constantly and rapidly. There is often not enough time for someone ... 

Keywords: definition mining, domain concept mining, information Integration, knowledge 
compilation, web content mining 



10 Web-based simulation: Web-based simulation 3: re-introducinc web-based simulation ^ 
Steven W. Reichenthal 

December 2002 Proceedings of the 34th conference on Winter simulation: exploring 
new frontiers 
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Publisher: Winter Simulation Conference 

Full text available: ^ pdf(184.66 KB) Additional Information: full citation , abstract, references 

This paper re-introduces web-based simulation fronn a web development point of view by 
first comparing tlie goals, structures, operations, and communication meclianisms on the 
web with those of current distributed simulation technology, and then synthesizing a new 
web-based simulation paradigm that more closely resembles the technology found on the 
web than Java-HLA solutions. The resulting paradigm is expressed through the Simulation 
Reference Markup Language (SRML) and Simulation Reference Sim ... 

11 Rank aggregation methods for the Web 

Cynthia Dwork, Ravi Kumar, Moni Naor, D. Sivakumar 
^ April 2001 Proceedings of the 10th international conference on World Wide Web 

Publisher: ACM Press 

Full text available: ^ pdf(288.25 KB) Additional Information: full citation, references , citings , index terms 



Keywords: metasearch, multi-word queries, rank aggregation, ranking functions, spam 



12 Open hypermedia and the web: The XML web: a first study | 

^ Laurent Mignet, Denilson Barbosa, Pierangelo Veltri 

^ May 2003 Proceedings of the 12th international conference on World Wide Web 
Publisher: ACM Press 

Full text available "P I pclf(726 59 KB) Additional Information: full cjtatjon, abstract, references, citings, index 
' l^^^l"^ '' terms 

Although originally designed for large-scale electronic publishing, XML plays an 
increasingly important role in the exchange of data on the Web. In fact, it is expected that 
Xi^L will become the lingua franca of the Web, eventually replacing HTML. Not 
surprisingly, there has been a great deal of interest on XML both in industry and in 
academia. Nevertheless, to date no comprehensive study on the XML Web (i.e., the 
subset of the Web made of XML documents only) nor on its contents has been made. 
Th ... 

Keywords: XML documents, XML web, statistical analysis, structural properties 



13 Crawling: Accelerated focused crawling through online relevance feedback 
^ Soumen Chakrabarti, Kunal Punera, Mallela Subramanyam 

^ May 2002 Proceedings of the 11th international conference on World Wide Web 
Publisher: ACM Press 

Additional Information: full citation , abstract, references, dtinps. index 



Full text available: , 

terms 

The organization of HTML into a tag tree structure, which is rendered by browsers as 
roughly rectangular regions with embedded text and HREF links, greatly helps surfers 
locate and click on links that best satisfy their information need. Can an automatic 
program emulate this human behavior and thereby learn to predict the relevance of an 
unseen HREF target page w.r.t. an information need, based on information limited to the 
HREF source page? Such a capability would be of great interest in focuse ... 

Keywords: document object model, focused crawling, reinforcement learning 
Local versus global link information in the Web 
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P^vel Calado, Berthier Ribeiro-Neto, Nivio Ziviani, Edieno Moura, Ilm6rio Silva 
January 2003 ACM Transactions on Information Systems (TOIS), volume li issue i 
Publisher: ACM Press 

Full text available* f ^pdf(413 06 KB) Additional Information: full citation , abstract , references , citings , index 
'^^^—^ '' terms 

Information derived from the cross-references among the documents in a hyperlinked 
environment, usually referred to as linic information, is considered important since it can 
be used to effectively improve document retrieval. Depending on the retrieval strategy, 
link information can be local or global. Local link information is derived from the set of 
documents returned as answers to the current user query. Global link information is 
derived from all the documents In the collection. In th ... 

Keywords: Belief networks, World Wide Web, link analysis, local and global information 



15 Enhanced topic distillation using text, markup tags, and hyperlinks 
^ Soumen Chakrabarti, l^ukul Joshi, VivekTawde 

^ September 2001 Proceedings of the 24th annual international ACM SIGIR conference 
on Research and development in information retrieval 

Publisher: ACM Press 

Full text available- " PI pdf(386.22 KB) Additional Information: full citation , attract, references , dtings, jnd^ 
^'^^^^^ terms 

Topic distillation is the analysis of hyperlink graph structure to identify mutually 
reinforcing authorities (popular pages) and hubs (comprehensive lists of links to 
authorities). Topic distillation is becoming common in Web search engines, but the best- 
known algorithms model the Web graph at a coarse grain, with whole pages as single 
nodes. Such models may lose vital details in the markup tag structure of the pages, and 
thus lead to a tightly linked irr ... 

^•6 Mining the network value of customers 

Pedro Domingos, Matt Richardson 
^ August 2001 Proceedings of the seventh ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

Full text available- odff 999.05 KB) Additional Information: full citation , abstract, references , dtings. iodM 
"^^^^^^ terms 

One of the major applications of data mining is in helping companies determine which 
potential customers to market to. If the expected profit from a customer is greater than 
the cost of marketing to her, the marketing action for that customer is executed. So far, 
work in this area has considered only the intrinsic value of the customer (i.e, the expected 
profit from sales to her). We propose to model also the customer's network value: the 
expected profit from sales to other customers she ... 

Keywords: Markov random fields, collaborative filtering, dependency networks, direct 
marketing, social networks, viral marketing 



17 Link Analysis: Improvement of HITS-based algorithms on web documents 
Longzhuang Li, Yi Shang, Wei Zhang 

May 2002 Proceedings of the 11th international conference on World Wide Web 
Publisher: ACIVI Press 

Full text available- tS apdff214.35 KB) Additional Information: full citation , abstract, references , dtings. index 
' i^ia-*^-' '■ terms 

In this paper, we present two ways to improve the precision of HITS-based algorithms on 
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Web documents. First, by analyzing the iimltations of current HITS-based algorithms, we 
propose a new weighted HITS-based method that assigns appropriate weights to in-Iinks 
of root documents. Then, we combine content analysis with HITS-based algorithms and 
study the effects of four representative relevance scoring methods, VSM, Okapi, TLS, 
and CDR, using a set of broad topic queries. Our experi ... 

Keywords: HITS-based algorithms, information retrieval, relevance scoring methods 



18 Cover story: structural Web search using a graph-based discovery system Q 

Nitish Manocha, Diane J. Cool<, Lawrence B. Holder 
^ March 2001 intelligence, volume 12 issue i 

Publisher: ACM Press 

Full text available: PI pdf(194.01 KB) * * ^ * 

\' ^ ' Additional Information: full citation, references , citinas , index terms 
j;^ html(31.2D KB) 



19 Evaluating topic-driven web crawlers 

^ Filippo l^lenczer, Gautam Pant, Padmini Srinivasan, Miguel E. Ruiz 

^ September 2001 Proceedings of the 24th annual international ACM SIGIR conference 
on Research and development in information retrieval 
Publisher: ACM Press 

Additional Information: full citation , abstract, references, citings, index 



Full text available: V.., 

terms 

Due to limited bandwidth, storage, and computational resources, and to the dynamic 
nature of the Web, search engines cannot index every Web page, and even the covered 
portion of the Web cannot be monitored continuously for changes. Therefore It is essential 
to develop effective crawling strategies to prioritize the pages to be indexed. The issue is 
even more important for topic-specific search engines, where crawlers must make 
additional decisions based on the relevance of visited pages. ... 

Keywords: InfoSpiders, PageRank, Web information retrieval, best-first search, focused 
crawlers, performance metrics, topic driven crawling 



20 Data integrity: Web application security assessment by fault injection and behavior H 
^ monitoring 

^ Yao-Wen Huang, Shih-Kun Huang, Tsung-Po Lin, Chung-Hung Tsai 

May 2003 Proceedings of the 12th international conference on World Wide Web 

Publisher: ACM Press 

Full text available: ■ ^pdff4.53 MB) Additional Information: full citation , abstract, references , citings. Index 

terms 

As a large and complex application platform, the World Wide Web is capable of delivering 
a broad range of sophisticated applications. However, many Web applications go through 
rapid development phases with extremely short turnaround time, making it difficult to 
eliminate vulnerabilities. Here we analyze the design of Web application security 
assessment mechanisms in order to identify poor coding practices that render Web 
applications vulnerable to attacks such as SQL injection and cross-site scr ... 

Keywords: black-box testing, complete crawling, fault injection, security assessment, 
web application testing 
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